visual language models AI News List | Blockchain.News
AI News List

List of AI News about visual language models

Time Details
2025-11-26
11:09
Chain-of-Visual-Thought (COVT): Revolutionizing Visual Language Models with Continuous Visual Tokens for Enhanced Perception

According to @godofprompt, the new research paper 'Chain-of-Visual-Thought (COVT)' introduces a breakthrough method for Visual Language Models (VLMs) by enabling them to reason using continuous visual tokens instead of traditional text-based chains of thought. This approach allows models to generate mid-thought visual latents such as segmentation cues, depth maps, edges, and DINO features, effectively giving the model a 'visual scratchpad' for spatial and geometric reasoning. The results are significant: COVT models achieved a 14% improvement in depth reasoning, a 5.5% boost on CV-Bench, and major gains on HRBench and MMVP benchmarks. The technique is compatible with leading VLMs like Qwen2.5-VL and LLaVA, with interpretable visual tokens that can be decoded for transparency. Notably, the research finds that traditional text-only reasoning chains actually degrade visual reasoning performance, whereas COVT’s visual grounding enhances accuracy in counting, spatial understanding, 3D awareness, and reduces hallucinated outputs. These findings point to transformative business opportunities for AI solutions requiring fine-grained visual analysis, accurate object recognition, and reliable spatial intelligence, especially in fields like robotics, autonomous vehicles, and advanced multimodal search. (Source: @godofprompt, Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, 2025)

Source
2025-10-10
10:55
Outstanding Paper Award for BAIR's Analysis of Visual Language Models at COLM2025

According to @berkeley_ai, researchers from the Berkeley AI Research (BAIR) lab led by @trevordarrell received the Outstanding Paper Award at #COLM2025 for their work titled 'Hidden in plain sight: VLMs overlook their visual representations.' This paper reveals that many visual language models (VLMs) fail to fully utilize their internal visual representations, leading to missed opportunities for improved performance in AI-powered image understanding and multimodal applications (Source: @berkeley_ai, 2025-10-10). This discovery has significant implications for the AI industry, highlighting a critical area for model optimization and new business opportunities in enhancing VLM architectures for sectors like e-commerce, healthcare, and autonomous systems.

Source